home

Semi-Supervised Learning


What is Semi-Supervised Learning?

Semi-Supervised Learning is a type of machine learning that combines a small amount of labeled data with a large amount of unlabeled data. This approach helps the model learn more effectively than using only labeled data, making it useful when labeling data is expensive or time-consuming.

Types of Semi-Supervised Learning

Semi-Supervised Learning image
Type What it is When it is used When it is preferred over other types When it is not recommended Examples of projects that is better use it incide him
Self-Training Self-Training is method where a model is first trained on a small labeled dataset, then it predicts labels for unlabeled data. The most confident predictions are added to the labeled set iteratively to improve the model. • When you have a small labeled dataset and a large unlabeled dataset.
• When labeling data is expensive or time-consuming.
• Useful for text classification, image recognition, and speech tasks.
• Better than Co-Training when you have only one view or feature set.
• Better than Graph-based Semi-Supervised Learning when graph structure is unavailable.
• Better than Generative Semi-Supervised Learning when you prefer a simpler, iterative approach without complex generative models.
• When predictions are not reliable — wrong labels may propagate errors.
• When you have multiple complementary feature views — Co-Training is better.
• When data naturally forms a graph or network structure — Graph-based methods perform better.
• When you can train a generative model easily — Generative SSL may give more robust results.
• Text classification with few labeled emails and many unlabeled emails.
• Image recognition with a small labeled image set and large unlabeled dataset.
• Speech recognition using a few transcribed audio samples and many unlabeled audio files.
• Medical diagnosis with a few labeled patient records and many unlabeled records.
Co-Training Co-Training is a method where two separate models are trained on different feature sets (views) of the same data. Each model labels unlabeled data for the other, iteratively improving both models. • When you have two or more complementary feature sets for the same data.
• When labeled data is scarce but unlabeled data is abundant.
• Common in web page classification, multimedia analysis, and text mining.
• Better than Self-Training when you have multiple independent views — reduces error propagation.
• Better than Graph-based SSL when graph structure is unknown or hard to construct.
• Better than Generative SSL when you prefer a simpler iterative method without complex generative models.
• When there is only one feature set — Co-Training cannot be applied.
• When views are not independent or weakly correlated — models may reinforce errors.
• When you can leverage graph structures — Graph-based SSL may be more effective.
• When you can train generative models easily — Generative SSL may give better performance.
• Web page classification using text content and hyperlink structure as separate views.
• Email spam detection using email body and metadata.
• Multimedia classification using image features and audio features.
Graph-based Semi-Supervised Learning Graph-based SSL represents data as a graph, where nodes are samples and edges represent similarity or relationships. Labels propagate from labeled nodes to unlabeled nodes through the graph structure. • When data naturally forms a graph or network structure (e.g., social networks, citation networks).
• When you have few labeled nodes and many unlabeled nodes connected via similarity.
• Common in node classification, link prediction, and network analysis.
• Better than Self-Training when data relationships are crucial, not just iterative labeling.
• Better than Co-Training when you don’t have multiple independent feature sets but have graph connectivity.
• Better than Generative SSL when you want direct label propagation rather than building a generative model.
• Ideal for networked or relational datasets.
• When data cannot be represented as a meaningful graph.
• When only one small feature set is available and connectivity is weak — Self-Training may suffice.
• When building graph is computationally expensive for very large datasets.
• When a generative approach is more natural or effective.
• Node classification in a citation network (classify papers by topic).
• Fraud detection in transaction networks.
• Social network analysis (predict user interests based on friends).
• Protein-protein interaction prediction in bioinformatics.
Generative Semi-Supervised Learning Generative Semi-Supervised Learning uses a generative model to learn the joint distribution of inputs and labels. It can generate synthetic data and assign labels to unlabeled data to improve learning. • When you have few labeled samples and a large amount of unlabeled data.
• When modeling the data distribution can help improve classification.
• Common in image generation, text generation, and speech tasks.
• Better than Self-Training when you want a probabilistic model that captures data distribution.
• Better than Co-Training when multiple views are unavailable.
• Better than Graph-based SSL when data is not naturally a graph.
• Ideal when generative modeling is feasible and can improve semi-supervised learning.
• When generating a good generative model is difficult.
• When data has complex structure or high dimensionality that is hard to model.
• When simpler methods (Self-Training or Co-Training) suffice.
• When the dataset is small and building a generative model may overfit.
• Semi-supervised image classification using a Variational Autoencoder (VAE).
• Text classification with few labeled texts and a generative language model.
• Speech recognition using generative models to label unlabeled audio.

Code 1 (Pseudo-labeling [Self-training])


import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1️⃣ Create a small labeled dataset
X, y = make_classification(n_samples=200, n_features=5, n_informative=3, n_classes=2, random_state=42)
X_labeled, X_unlabeled, y_labeled, _ = train_test_split(X, y, test_size=0.7, random_state=42)  # 30% labeled, 70% unlabeled

# 2️⃣ Train initial model on labeled data
model = RandomForestClassifier()
model.fit(X_labeled, y_labeled)

# 3️⃣ Predict pseudo-labels for unlabeled data
pseudo_labels = model.predict(X_unlabeled)

# 4️⃣ Combine labeled data + pseudo-labeled data
X_combined = np.vstack((X_labeled, X_unlabeled))
y_combined = np.hstack((y_labeled, pseudo_labels))

# 5️⃣ Retrain model on combined dataset
model.fit(X_combined, y_combined)

# 6️⃣ Evaluate model (on all data, just for demonstration)
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
print("Accuracy after pseudo-labeling:", accuracy)

Code 2 (Label Propagation [Graph-based SSL])


import numpy as np
from sklearn.datasets import make_classification
from sklearn.semi_supervised import LabelPropagation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1️⃣ Create dataset
X, y = make_classification(n_samples=200, n_features=5, n_informative=3, n_classes=2, random_state=42)

# 2️⃣ Split into labeled and unlabeled data
X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(X, y, test_size=0.7, random_state=42)
y_unlabeled[:] = -1  # Mark unlabeled points with -1 as required by LabelPropagation

# 3️⃣ Combine labeled and unlabeled labels
y_combined = np.hstack((y_labeled, y_unlabeled))
X_combined = np.vstack((X_labeled, X_unlabeled))

# 4️⃣ Initialize Label Propagation model
label_prop_model = LabelPropagation()
label_prop_model.fit(X_combined, y_combined)

# 5️⃣ Predict labels for all data
y_pred = label_prop_model.transduction_

# 6️⃣ Evaluate on original labels
accuracy = accuracy_score(y, y_pred)
print("Accuracy after Label Propagation:", accuracy)